The usual scenario for learning tasks such as those presented in this book include a list of instances (represented as feature/value pairs) and a special feature (the target class) that we want to predict for future instances based on the values of the remaining features. However, the source data does not usually come in this format. We have to extract what we think are potentially useful features and convert them to our learning format. This process is called feature extraction or feature engineering, and it is an often underestimated but very important and time-consuming phase in most real- world machine learning tasks.
Start by importing numpy, scikit-learn, pandas, and pyplot, the Python libraries we will be using in this chapter. Show the versions we will be using (in case you have problems running the notebooks).
In [1]:
%pylab inline
import IPython
import sklearn as sk
import numpy as np
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
print 'IPython version:', IPython.__version__
print 'numpy version:', np.__version__
print 'scikit-learn version:', sk.__version__
print 'matplotlib version:', matplotlib.__version__
print 'pandas version:', pd.__version__
The Python package pandas (http://pandas.pydata.org/), for example, provides data structures and tools for data analysis. It aims to provide similar features to those of R, the popular language and environment for statistical computing. We will use pandas to import the Titanic data we presented in Chapter 2, Supervised Learning, and convert them to the scikit-learn format.
In [2]:
titanic = pd.read_csv('data/titanic.csv')
print titanic
You can see that each csv column has a corresponding feature into the DataFrame, and that the feature type is induced from the available data. We can inspect some features to see what they look like.
In [3]:
print titanic.head()[['pclass', 'survived', 'age', 'embarked', 'boat', 'sex']]
In [4]:
titanic.describe()
Out[4]:
he main difficulty we have now is that scikit-learn methods expect real numbers as feature values. In Chapter 2, Supervised Learning, we used the LabelEncoder and OneHotEncoder preprocessing methods to manually convert certain categorical features into 1-of-K values (generating a new feature for each possible value; valued 1 if the original feature had the corresponding value and 0 otherwise). This time, we will use a similar scikit-learn method, DictVectorizer, which automatically builds these features from the different original feature values. Moreover, we will program a method to encode a set of columns in a unique step.
In [5]:
from sklearn import feature_extraction
def one_hot_dataframe(data, cols, replace=False):
""" Takes a dataframe and a list of columns that need to be encoded.
Returns a 3-tuple comprising the data, the vectorized data,
and the fitted vectorizor.
Modified from https://gist.github.com/kljensen/5452382
"""
vec = feature_extraction.DictVectorizer()
mkdict = lambda row: dict((col, row[col]) for col in cols)
#print 'Construyo vecData...'
#print data[cols]
#print cols
# Create a dictionary for each row
#print data[cols].apply(mkdict, axis=1).data
#[0]['pclass']
#vecData = pd.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())
vecData = pd.DataFrame(vec.fit_transform(data[cols].to_dict(outtype='records')).toarray())
vecData.columns = vec.get_feature_names()
vecData.index = data.index
if replace is True:
data = data.drop(cols, axis=1)
data = data.join(vecData)
return (data, vecData)
titanic, titanic_n= one_hot_dataframe(titanic, ['pclass', 'embarked', 'sex'], replace=True)
In [6]:
titanic.describe()
Out[6]:
What does the 'embarked' feature has?
In [7]:
print titanic_n.head(5)
print titanic_n[titanic_n['embarked'] != 0].head()
Convert the remaining categorical features...
In [8]:
print titanic.head()
titanic, titanic_n = one_hot_dataframe(titanic, ['home.dest', 'room', 'ticket', 'boat'], replace=True)
We also have to deal with missing values, since DecisionTreeClassifier we plan to use does not admit them on input. Pandas allow us to replace them with a fixed value using the fillna method. We will use the mean age for the age feature, and 0 for the remaining missing attributes. Adjust N/A ages with the mean age
In [9]:
print titanic['age'].describe()
mean = titanic['age'].mean()
titanic['age'].fillna(mean, inplace=True)
print titanic['age'].describe()
Complete n/a with zeros
In [10]:
titanic.fillna(0, inplace=True)
In [11]:
print titanic
Build the training and testing dataset
In [12]:
from sklearn.cross_validation import train_test_split
titanic_target = titanic['survived']
titanic_data = titanic.drop(['name', 'row.names', 'survived'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(titanic_data, titanic_target, test_size=0.25, random_state=33)
Let's see how a decision tree works with the current feature set.
In [13]:
from sklearn import tree
dt = tree.DecisionTreeClassifier(criterion='entropy')
dt = dt.fit(X_train, y_train)
In [14]:
import pydot, StringIO
dot_data = StringIO.StringIO()
tree.export_graphviz(dt, out_file=dot_data, feature_names=titanic_data.columns)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_png('titanic.png')
from IPython.core.display import Image
Image(filename='titanic.png')
Out[14]:
In [15]:
from sklearn import metrics
def measure_performance(X, y, clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
y_pred = clf.predict(X)
if show_accuracy:
print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n"
if show_classification_report:
print "Classification report"
print metrics.classification_report(y, y_pred),"\n"
if show_confussion_matrix:
print "Confussion matrix"
print metrics.confusion_matrix(y, y_pred),"\n"
In [16]:
from sklearn import metrics
measure_performance(X_test, y_test, dt, show_confussion_matrix=False, show_classification_report=False)
Working with a smaller feature set may lead to better results. So we want to find some way to algorithmically find the best features. This task is called feature selection and is a crucial step when we aim to get decent results with machine learning algorithms. If we have poor features, our algorithm will return poor results no matter how sophisticated our machine learning algorithm is. Select only the 20% most important features, using a chi2 test
In [17]:
from sklearn import feature_selection
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=20)
X_train_fs = fs.fit_transform(X_train, y_train)
print titanic_data.columns[fs.get_support()]
print fs.scores_[2]
print titanic_data.columns[2]
Evaluate performance with the new feature set
In [18]:
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confussion_matrix=False, show_classification_report=False)
Find the best percentil using cross-validation on the training set
In [19]:
from sklearn import cross_validation
percentiles = range(1, 100, 5)
results = []
for i in range(1, 100, 5):
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=i)
X_train_fs = fs.fit_transform(X_train, y_train)
scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
#print i,scores.mean()
results = np.append(results, scores.mean())
optimal_percentil = np.where(results == results.max())[0]
print "Optimal number of features:{0}".format(percentiles[optimal_percentil]), "\n"
# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation accuracy)")
pl.plot(percentiles,results)
print "Mean scores:",results
Evaluate our best number of features on the test set
In [20]:
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=percentiles[optimal_percentil])
X_train_fs = fs.fit_transform(X_train, y_train)
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confussion_matrix=False, show_classification_report=False)
In [21]:
print dt.get_params()
Compute the best criterion, using the held out set (see next notebook on Model Selection)
In [22]:
dt = tree.DecisionTreeClassifier(criterion='entropy')
scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
print "Entropy criterion accuracy on cv: {0:.3f}".format(scores.mean())
dt = tree.DecisionTreeClassifier(criterion='gini')
scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
print "Gini criterion accuracy on cv: {0:.3f}".format(scores.mean())
In [23]:
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confussion_matrix=False, show_classification_report=False)